The systematic collection of speech corpora for all eleven official South african languages
نویسندگان
چکیده
In this paper we outline the methods and best practices when collecting speech data for under-resourced languages. The focus of this discussion is on showing ways of improving the quality of the collection and turnaround time. This paper shows how to deal with matters concerning assistants and technical problems, as well as suggesting ways in which data management may be optimised with the use of certain techniques. This article aims at providing the reader with a total overview of improvements made during the course of a real data collection project with tangible problems and results.
منابع مشابه
Spelling Checker-based Language Identification for the Eleven Official South African Languages
Language identification is often the first step when compiling corpora from web pages or other unstructured sources. In this paper, an effective and accurate method for identification of all eleven official South African languages is presented. The method is based on reusing commercial spelling checkers and consists of a multi-stage architecture that is described in detail. We describe the impl...
متن کاملCollecting and evaluating speech recognition corpora for 11 South African languages
We describe the Lwazi corpus for automatic speech recognition (ASR), a new telephone speech corpus which contains data from the eleven official languages of South Africa. Because of practical constraints, the amount of speech per language is relatively small compared to major corpora in world languages, and we report on our investigation of the stability of the ASR models derived from the corpu...
متن کاملThe NCHLT speech corpus of the South African languages
The NCHLT speech corpus contains wide-band speech from approximately 200 speakers per language, in each of the eleven official languages of South Africa. We describe the design and development processes that were undertaken in order to develop the corpus, and report on associated materials such as orthographic transcriptions and pronunciation dictionaries that were released as part of the corpu...
متن کاملRapid Development of TTS Corpora for Four South African Languages
This paper describes the development of text-to-speech corpora for four South African languages. The approach followed investigated the possibility of using low-cost methods including informal recording environments and untrained volunteer speakers. This objective and the additional future goal of expanding the corpus to increase coverage of South Africa’s 11 official languages necessitated exp...
متن کاملAfrican speech technology (AST) telephone speech databases: corpus design and contents
The African Speech Technology project is developing telephone speech databases for five of South Africa’s eleven official languages, i.e. South African English, Afrikaans, and three African languages, Zulu, Xhosa, and Southern Sotho. These databases will be fully transcribed – orthographically and phonetically – and will be used for the training and testing of phoneme-based, speaker-independent...
متن کامل